Text Area Identification in Web Images
نویسندگان
چکیده
With the explosive growth of the World Wide Web, millions of documents are published and accessed on-line. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special characteristics. This paper proposes a novel Web image processing algorithm that aims to locate text areas and prepare them for OCR procedure with better results. Our methodology for text area identification has been fully integrated with an OCR engine and with an Information Extraction system. We present quantitative results for the performance of the OCR engine as well as qualitative results concerning its effects to the Information Extraction system. Experimental results obtained from a large corpus of Web images, demonstrate the efficiency of our methodology.
منابع مشابه
A novel Web image processing algorithm for text area identification that helps commercial OCR engines to improve their Web image recognition efficiency
In this paper, a novel Web image processing algorithm is presented for text area identification. Statistics show that a significant part of Web text information is encoded in Web images. Since Web images have special characteristics that sometimes distinguish them from other types of images, commercial OCR products often fail to recognize Web images due to their special key characteristics. Thi...
متن کاملIdentifying Story and Preview Images in News Web Pages
The World Wide Web provides an increasingly powerful and popular publication mechanism. Web documents often contain a large number of images serving various different purposes. This paper focuses on images that are associated with a story or preview to a story. Such images often accompany the key content on a web page, thus their identification is important for applications such as web page sum...
متن کاملDocument Image Dewarping Based on Text Line Detection and Surface Modeling (RESEARCH NOTE)
Document images produced by scanner or digital camera, usually suffer from geometric and photometric distortions. Both of them deteriorate the performance of OCR systems. In this paper, we present a novel method to compensate for undesirable geometric distortions aiming to improve OCR results. Our methodology is based on finding text lines by dynamic local connectivity map and then applying a l...
متن کاملExploring access to scientific literature using content-based image retrieval
The number of articles published in the scientific medical literature is continuously increasing, and Web access to the journals is becoming common. Databases such as SPIE Digital Library, IEEE Xplore, indices such as PubMed, and search engines such as Google provide the user with sophisticated full-text search capabilities. However, information in images and graphs within these articles is ent...
متن کاملWeb Image Annotation Using an Effective Term Weighting
The number of images on the World Wide Web has been increasing tremendously. Providing search services for images on the web has been an active research area. Web images are often surrounded by different associated texts like ALT text, surrounding text, image filename, html page title etc. Many popular internet search engines make use of these associated texts while indexing images and give hig...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004